This notebook explores the more recent data from NYC Open Data Data Set.
This dataset can also be reached and interacted with through its Google BigQuery location
Try a different version: handout page, Tab Navigation
Reduce Car Accidents in Brooklyn
For this exercise, we’d like you to analyze data on New York motor vehicle collisions and answer the following question:
What are your ideas for reducing accidents in Brooklyn?
Imagine you are preparing this presentation for the city council who will use it to inform new legislation and/or projects.
Briefly:
Libraries that will be used during exploration
library(magrittr)
library(dplyr)
library(ggplot2)
library(viridis)
library(plotly)
library(maps)
library(rgeos)
library(rgdal)
library(ggthemes)
library(crosstalk)
library(leaflet)
library(d3scatter)
library(d3heatmap)
library(rnoaa)
#library(ggmap)
Collect API tokens in Environment Variables (purposefully kept hidden here). Tokens and keys used include Google Maps API key (get one here, Mapbox Access Token(get one here) and an NCDC token (here) for NOAA weather data.
Load data from /data directory and into memory
dt <- read.csv(file = "data/NYPD_Motor_Vehicle_Collisions.csv")
Inspect structure of dataset with the str() command:
datatable(str(dt))
Inspect summary of dataset with summary() command:
datatable(summary(dt), colnames = c('Observed Variable' = 'Var2',
"Quantitative Measures" = "Freq"))
Our Dataset structure revealed the variables and their classes sapply(names(dt), function(x) paste0(x, ' is class: ', class(dt[[x]]))) =
The first thing to come to mind with such a factor heavy dataset is counting. Factors are not a hodgepodge collection of values, observed as they are pick up off the ground. Each Level of a factor - ideally - should have been intentionally designed. Ordered and distributed according to a purpose greater than the unit. Though not statically related as quantifiable assets, the levels in a factor are each related to one and other in addition to the group as a whole.
There are a lot of empty cells. To make sure we use a universal value for blank or Not Available we will assign the value NA to all blank cells. While munging around with the data, add what could be a valuable variable created from two current variables. The DATE and TIME variables are set up as factors. This has interesting categorical value so they will stay in the data. Rather than replace the two rows, add a third one in a POSIX date form.
dt[dt == ''] <- NA
# create date.time column for time use
#dt$date.time <- as.Date(dt$DATE, format("%m/%d/%Y"))
# use hours & min in date.time var for calculations
dt <- within(dt, {date.time = as.POSIXct(strptime(paste(DATE, TIME), "%m/%d/%Y %H:%M"))})
With Latitude and Longitude present and appearing to be fairly well documented, let’s take a quick look at how these accidents look over an interactive world map (incase of mistakes outlying somewhere aside from New York). We will use the BOROUGH variable as a factor. This gives the geographic association of each borough and allows us early forsight into anything specific about our point of interest BOURGH == "BROOKLYN"
mp <- dt %>%
plot_mapbox(lat = ~LATITUDE, lon = ~LONGITUDE,
split = ~BOROUGH, mode = 'scattermapbox') %>%
layout(mapbox = list(zoom = 9,
center = list(lat = ~(40.7), lon = ~(-74.0))))
plotly_build(mp)